June 14, 2017

What does my group do?

  • Study the molecular basis of variation in development and disease
  • Using high-throughput experimental methods

What is epigenomics?

What makes them different?

Much human variation is due to difference in ~6 million DNA base pairs (0.1% of genome)

What makes them different?

Genes are expressed differently during different stages and in different tissues.

DNA is packed, making certain parts inaccessible, and this packing is dynamic.

DNA methylation is a chemical modification of DNA, involved in gene expression regulation.

[Robertson and Wolffe, Nat Rev Genet, 2000]

Probing DNA methylation

The data

Probing DNA methylation

  • local-likelihood smoothing method
  • high-frequency smoothing estimates local methylation structure (small domains)
  • low-frequency smoothing estimates long-range methylation structure (large domains)
Nature Genetics, 2011 Bioinformatics, 2013

DNA methylation in cancer

Large blocks of hypo-methylation in colon cancer

Nat. Genetics, 2011
  • overlaps with other important genomic domains
  • genes within these blocks are tissue-specific

Genes with hyper-variable expression in colon cancer are enriched within these blocks.

Nat. Genetics, 2011

Hypo-methylation blocks observed across five solid tumor types.

Genome Medicine, 2014

Gene expression hyper-variability enriched in hypo-methylation blocks in other cancer types.

Genome Medicine, 2014

Genes with consistent hyper-variable expression across tumors are tissue-specific.

BMC Bioinformatics, 2013

Summary

  • large domains of methylation loss are a stable mark across cancer types
  • gene expression hyper-variability is enriched within these domains
  • hyper-variable genes within these regions are tissue-specific and involved in cellular fate

Genes are expressed differently during different stages and in different tissues.

Software

  • State-of-the-art computational and statistical analysis platform
  • We develop and apply methods for these analyses in this platform
  • Our collaborators do analysis in this platform with us
  • metagenomeSeq
  • metagenomeFeatures
  • antiProfiles
  • minfi
  • bumphunter
  • HTShape
  • qsmooth
  • Rcplex
  • Rcsdp

Collaborative and exploratory analysis

  • Data transformation and modeling: data smoothing, region finding (R/Bioconductor: Bsmooth, minfi)
  • Exploration: search by gene, search by overlap
  • Contextual analysis: overlap with other data (our own, other labs, UCSC, ensembl)

Genomic Data Science

  • We have unprecedented ability to measure
  • and lots of publicly available data to contextualize it
[H. Wickham]

Integrative, visual and computational exploratory analysis of genomic data

  • Browser-based
  • Interactive
  • Integration of data
  • Reproducible dissemination
  • Communication with R/Bioconductor: epivizr package
e.g.: http://epiviz.cbcb.umd.edu/?ws=YOsu0RmUc9l
[Nat. Methods, 2014]

Creativity in exploration

We are building software applications to support creative exploratory analysis of large genome-wide datasets…

[T. Speed]

Summarization: summarize integrated measurements (computed on data subsets)

Statistically-guided exploration: Calculate a statistic of interest

# Get tumor methylation base-pair data
m <- assay(se)[,"tumor"]

# Compute regions with highest variability across cpgs
region_stat <- calcWindowStat(m, step=25, window=80, stat=rowSds)
s <- region_stat[,"stat"]

Explore data based on statistic

What's around the regions with highest across CpG variability?

# get locations in decreasing order
o <- order(s, decreasing=TRUE)
indices <- region_stat[o, "indices"]
slideShowRegions <- rowRanges(se)[indices] + 1250000L
mgr$slideshow(slideShowRegions)

dynamically extensible: Easily integrate new data types and add new visualizations.

  • Based on classic "three-table" design in genomic data analysis
  • Data providers define coordinate space

Visualization design goals

  • Context
  • Integrate and align multiple data sources; navigate; search
  • Connect: brushing
  • Encode: map visualization properties to data on the fly
  • Reconfigure: multiple views of the same data
[Perer & Shneiderman]

Visualization goals

  • Data
  • Select and filter: tight-knit integration with R/Bioconductor;
  • (current work) filters on visualization propagate to data environment
  • Model
  • New 'measurements' the result of modeling; perhaps suggested by data context
[Perer & Shneiderman]

[H. Wickham]

One interpretation of Big Data is Many relevant sources of contextual data

  • Easily access/integrate contextual data
  • Driven by exploratory analysis of immediate data
  • Iterative process
  • Visual and computational exploration go hand in hand

Acknowledgements

Justin Wagner, Jayaram Kancherla (CBCB)
Florin Chelaru (now at Google), Joseph Paulson (now at Genentech)
Feinberg Lab & K. Hansen (JHU), R. Irizarry (Harvard)

Funding: NIH, Genentech, Gates Foundation

More information

http://hcbravo.org
@hcorrada